"Alone we can do so little; together we can do so much." — Helen Keller
The Smith Busters are a group of aspiring MSIS graduate students who are enthusiastic about data and want to emphasize the importance of data analytics in today's data-driven world. We are ready to apply and present the knowledge we have gained as data analysts both inside and outside of the University of Maryland in order to make a difference in the communities.
The SmithBusters are Rishikesh Baskaran, Yash Makadia, Satvik Narang, Anushka Ranjan and Shweta Salelkar.
The Smithbusters will produce visualizations using data from the US Census to help us visualize where places require assistance in any of the characteristics specified, such as education, employment etc. We will make predictions about how the numerous criteria we investigate impact the income and livelihood of Americans in various ways. SmithBuster intends to answer the following question with this data.
In order to build this index, we will need to evaluate answers to the following questions:
Learning which states are doing well and which fall short of the national average requires knowledge of the median income of American families and how it varies from state to state. Our research will focus on high-income states to learn the employment structure that has contributed to their success. With that, the states trailing behind automatically become our choice of interest to focus on in the later parts of our report.
A popular belief is that better education will lead to more job opportunities and a higher income. We're here to find out if that's true by comparing the levels of education in the state with the most income and the state with the least. If their level of education and income are directly related, education would be a straightforward task to focus on.
We will need a working model to put our analysis to use and know right away from the predicted income if a state has room for improvement.
When our model does show that there is room for change, we will need to know which factors to start working on right away and how important each one is so that organizations can decide what to work on first.
Our data was obtained from the US Census data website (https://data.census.gov/). We have obtained information on the below factors for the year 2020:
Table Name -> Description
DP03 -> Economic characteristics
DP05 -> Demographic and housing estimates
S1501 -> Educational attainment
S2506 Financial characteristics for housing units
Each of these tables consists of data for zipcode of US. In order to join relevant columns from these tables, we use Zip code as the primary key to join the above tables to create a consolidated dataset.
The columns we chose for our analysis largely had to do with how they relate to our objective. We needed to understand what factors affected the income of an individual and how his income affects his living conditions. DP03 gave us data on the income and income characteristics for the population and DP05 was about population characteristics and ethnicity. We used S1501 to understand the levels of educational attainment and of how many people in a city were educated up to high school, college and undergrad level. S2506 gave us two columns that informed us the mortgaged houses and what the median value was.
The Smithbusters believe that we should be graded more strongly on our data processing. In order to perform our analysis, build the community needs index, we were required to perform a number of cleaning, processing and transforming activities on the dataset.
Apart from having a highly dimensional dataset, we had to ensure that relevant factors were being considered which involved creation of a new dataset by merging multiple tables.
We also created high level factors (such as jobs) needed by the community by assigning weights to sub-factors (such as Public, private jobs, self-employed) and converted absolute numbers provided for these factors into proportions to compare geographies.
# Importing all the required libararies
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
# Reading the data reference csv containing list of columns and source tables to be used from the US Census website
df1=pd.read_csv('.\data\DataReference.csv')
#creating a list of respective columns to be filtered from each of the source tables
cols_DP03=df1[df1['TabName'].str.contains('DP03')].values[:,1].tolist()
cols_DP05=df1[df1['TabName'].str.contains('DP05')].values[:,1].tolist()
cols_S2506=df1[df1['TabName'].str.contains('S2506')].values[:,1].tolist()
cols_S1501=df1[df1['TabName'].str.contains('S1501')].values[:,1].tolist()
# 'Name' column would be used as the key to join the data
cols_DP03.insert(0,'NAME')
cols_DP05.insert(0,'NAME')
cols_S2506.insert(0,'NAME')
cols_S1501.insert(0,'NAME')
# importing the csv files downloaded from US Census website
df_DP05=pd.read_csv(r'.\data\ACSDP5Y2020.DP05-Data.csv', dtype='str',header=0)
df_DP03=pd.read_csv(r'.\data\ACSDP5Y2020.DP03-Data.csv', dtype='str',header=0)
df_S2506=pd.read_csv(r'.\data\ACSST5Y2020.S2506-Data.csv', dtype='str',header=0)
df_S1501=pd.read_csv(r'.\data\ACSST5Y2020.S1501-Data.csv', dtype='str',header=0)
# First 2 rows indicate the header
df_DP05.drop([0], axis=0, inplace=True)
df_DP03.drop([0], axis=0, inplace=True)
df_S2506.drop([0], axis=0, inplace=True)
df_S1501.drop([0], axis=0, inplace=True)
# Final dataframe would consists of the individual dataframes joined together on the key 'NAME' (which stored the zip code)
df_merged=pd.merge(df_DP03[cols_DP03], df_DP05[cols_DP05], on='NAME')
df_merged=pd.merge(df_merged,df_S2506[cols_S2506], on='NAME')
df_merged=pd.merge(df_merged,df_S1501[cols_S1501], on='NAME')
The data we collected from the four tables was highly dimensional. The table below depicts the number of columns each data table consisted of. The number of rows in each table was 33122, which is the number of Zip codes in America.
Sr-- Table -- Description
1 -- DP03 -- Economic characteristics
2 -- DP05 -- Demographic and housing estimates
3 -- S1501 -- Educational attainment
4 -- S2506 -- Financial characteristics for housing units
As part of our analysis, the final data frame created after merging the relevant columns consists of 33122 rows and 39 columns. In the next step, we will perform the following activities:
1. Renaming the columns:
The column names imported from the csv files from the US Census website are present in a coded format (Ex - DP05_0001E). Using the metadata information (also available from the website) we have renamed the columns as per their description provided in the metadata.
2. Extracting Zip code values using regex:
The column ‘Name’ contains the zip code values prefixed with another code, called zip code tabulation code (Ex - ZCTA5 00601). We use a regular expression to extract 5 numeric digits from this column which represents the zip code.
3. Adding City and State information with Zipcode:
We will add the city and state information about the zipcodes in order to evaluate the mean of different parameters (such as Income, Education etc) at a geographical level and compare the results with the zip code. This will help in identifying how a particular zip code is performing when compared to the city or state average.
4. Identification of columns with null/inconsistent values:
For our analysis, we came up with relevant factors such as average Income, Education, and Financial details of a person in a zip code. From our list of factors, we identified 2 factors - ‘Poverty’ and ‘Child Poverty’ for which no data was available for any zip code. These columns were hence dropped from the analysis.
5. Identification of rows with null/inconsistent values:
We analyzed the column 'Income' and found 1669 zip codes contain inconsistent values such as '*' or '-'. This value amounts to ~5% of the total population hence we decided to exclude them from our analysis by dropping them.
6. Data type conversion for numeric columns:
In order to perform aggregations such as average across all the zipcodes in a state, we need to convert the data type of columns from existing type (string) to an integer.
7. Transformation of absolute values to percentages
In order to compare factors such as education, employment etc across different geographical areas, we will need to transform the absolute values (which represent estimate of count of people for the given factor) into proportions by dividing it by total population in that area
8. Merging Sub-factors to create Factors using Weighted averages
Our last step in managing our data is to combine multiple factors of a particular category to a single factor, using comparative calculations. Data for each state will be compared with that of the National average to determine whether the factor it lacks in needs to be improved or not.
# To check the name of all the columns of the data frame
df_merged.columns
Index(['NAME', 'DP03_0062E', 'DP03_0087M', 'DP03_0088E', 'DP03_0088M',
'DP03_0128E', 'DP03_0129E', 'DP03_0027E', 'DP03_0028E', 'DP03_0029E',
'DP03_0034E', 'DP03_0031E', 'DP03_0019E', 'DP03_0020E', 'DP03_0021E',
'DP03_0022E', 'DP03_0023E', 'DP03_0024E', 'DP03_0025E', 'DP03_0001E',
'DP03_0047E', 'DP03_0048E', 'DP03_0049E', 'DP03_0050E', 'DP03_0005E',
'DP05_0001E', 'DP05_0002E', 'DP05_0003E', 'DP05_0072E', 'DP05_0037E',
'DP05_0065E', 'DP05_0066E', 'DP05_0067E', 'DP05_0068E', 'DP05_0087E',
'S2506_C01_001E', 'S2506_C02_040E', 'S1501_C01_009E', 'S1501_C01_011E',
'S1501_C01_015E'],
dtype='object')
df_merged.shape
(33120, 40)
Now we will rename the column name based on the Table name. As there are 30+ columns converting data frame into dictionary to get the list of all the column name and table name.
# To convert the data frame having column name and table name into dictionary. Column Name as a key and Table Name as a value in dictionary.
pd.Series(df1.ColName.values,index=df1.TabName).to_dict()
{'DP05_0001E': 'TotalPop',
'DP05_0002E': 'Men',
'DP05_0003E': 'Women',
'DP05_0072E': 'Hispanic',
'DP05_0037E': 'White',
'DP05_0065E': 'Black',
'DP05_0066E': 'Native',
'DP05_0067E': 'Asian',
'DP05_0068E': 'Pacific',
'DP05_0087E': 'Citizen',
'DP03_0062E': 'Income',
'DP03_0087M': 'IncomeErr',
'DP03_0088E': 'IncomePerCap',
'DP03_0088M': 'IncomePerCapErr',
'DP03_0128E': 'Poverty',
'DP03_0129E': 'ChildPoverty',
'DP03_0027E': 'Professional',
'DP03_0028E': 'Service',
'DP03_0029E': 'Office',
'DP03_0034E': 'Construction',
'DP03_0031E': 'Production',
'DP03_0019E': 'Drive',
'DP03_0020E': 'Carpool',
'DP03_0021E': 'Transit',
'DP03_0022E': 'Walk',
'DP03_0023E': 'OtherTransp',
'DP03_0024E': 'WorkAtHome',
'DP03_0025E': 'MeanCommute',
'DP03_0001E': 'Employed',
'DP03_0047E': 'PrivateWork',
'DP03_0048E': 'PublicWork',
'DP03_0049E': 'SelfEmployed',
'DP03_0050E': 'FamilyWork',
'DP03_0005E': 'Unemployment',
'S2506_C01_001E': 'Housing with Mortgage',
'S2506_C02_040E': 'Median Monthly housing costs',
'S1501_C01_009E': 'High School Graduate',
'S1501_C01_011E': 'Some College or Associates degree',
'S1501_C01_015E': 'Bachelors degree or higher'}
# To rename the columns names
df_merged.rename(columns = {'DP05_0001E': 'TotalPop',
'DP05_0002E': 'Men',
'DP05_0003E': 'Women',
'DP05_0072E': 'Hispanic',
'DP05_0037E': 'White',
'DP05_0065E': 'Black',
'DP05_0066E': 'Native',
'DP05_0067E': 'Asian',
'DP05_0068E': 'Pacific',
'DP05_0087E': 'Citizen',
'DP03_0062E': 'Income',
'DP03_0087M': 'IncomeErr',
'DP03_0088E': 'IncomePerCap',
'DP03_0088M': 'IncomePerCapErr',
'DP03_0128E': 'Poverty',
'DP03_0129E': 'ChildPoverty',
'DP03_0027E': 'Professional',
'DP03_0028E': 'Service',
'DP03_0029E': 'Office',
'DP03_0034E': 'Construction',
'DP03_0031E': 'Production',
'DP03_0019E': 'Drive',
'DP03_0020E': 'Carpool',
'DP03_0021E': 'Transit',
'DP03_0022E': 'Walk',
'DP03_0023E': 'OtherTransp',
'DP03_0024E': 'WorkAtHome',
'DP03_0025E': 'MeanCommute',
'DP03_0001E': 'Employed',
'DP03_0047E': 'PrivateWork',
'DP03_0048E': 'PublicWork',
'DP03_0049E': 'SelfEmployed',
'DP03_0050E': 'FamilyWork',
'DP03_0005E': 'Unemployment',
'S2506_C01_001E': 'Housing with Mortgage',
'S2506_C02_040E': 'Median Monthly housing costs',
'S1501_C01_009E': 'High School Graduate',
'S1501_C01_011E': 'Some College or Associates degree',
'S1501_C01_015E': 'Bachelors degree or higher'
},inplace=True)
# To verify the column name of the data frame
df_merged.head()
| NAME | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | ... | Black | Native | Asian | Pacific | Citizen | Housing with Mortgage | Median Monthly housing costs | High School Graduate | Some College or Associates degree | Bachelors degree or higher | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ZCTA5 00601 | 14398 | 2379 | 7256 | 639 | (X) | (X) | 1005 | 1088 | 587 | ... | 427 | 102 | 2 | 0 | 13421 | 596 | 747 | 3316 | 1290 | 1868 |
| 1 | ZCTA5 00602 | 16771 | 2297 | 9981 | 594 | (X) | (X) | 2828 | 2455 | 2654 | ... | 6708 | 4740 | 17 | 0 | 30430 | 1908 | 855 | 7103 | 3061 | 5575 |
| 2 | ZCTA5 00603 | 15786 | 3520 | 11794 | 1030 | (X) | (X) | 3654 | 2615 | 2940 | ... | 2026 | 300 | 74 | 14 | 36770 | 3528 | 814 | 9627 | 2558 | 7653 |
| 3 | ZCTA5 00606 | 14980 | 2954 | 6901 | 862 | (X) | (X) | 150 | 316 | 237 | ... | 168 | 22 | 0 | 0 | 5007 | 174 | 549 | 1385 | 201 | 482 |
| 4 | ZCTA5 00610 | 20167 | 3137 | 10643 | 832 | (X) | (X) | 1787 | 1942 | 1910 | ... | 4229 | 2926 | 9 | 1 | 21725 | 1868 | 733 | 5966 | 2166 | 3292 |
5 rows × 40 columns
We add the city and state information about the zipcodes in order to evaluate the mean of different parameters (such as Income, Education etc) at a geographical level and compare the results with the zip code. This will help in identifying how the a particular zip code is performing when compared to the city or state average.
# To extract the Zip Code from the NAME column and giving it a new column name 'ZipCode'
df_merged['ZipCode'] = df_merged['NAME'].str.findall(r'\d{5}')
# To get the first element of the list of the Zip Code column
df_merged.ZipCode = df_merged.ZipCode.map(lambda x: x[0])
We add the city and state information about the zipcodes in order to evaluate the mean of different parameters (such as Income, Education etc) at a geographical level and compare the results with the zip code. This will help in identifying how a particular zip code is performing when compared to the city or state average
# To add city and state information and map with the zip code
city_state_mapping_df = pd.read_csv(r'.\data\uszips.csv', dtype='str', header=0)
# To map the Zip code, city and state name
city_state_mapping_df = city_state_mapping_df[['zip','city','state_name']]
# To rename the Zip Code, State and City columns
city_state_mapping_df.rename(columns = {'zip':'ZipCode', 'city':'City', 'state_name':'State' }, inplace = True)
# To merge city and state based on the Zip code
df_merged = pd.merge(df_merged , city_state_mapping_df, on='ZipCode')
# To display the first five rows
df_merged.head()
| NAME | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | ... | Pacific | Citizen | Housing with Mortgage | Median Monthly housing costs | High School Graduate | Some College or Associates degree | Bachelors degree or higher | ZipCode | City | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ZCTA5 00601 | 14398 | 2379 | 7256 | 639 | (X) | (X) | 1005 | 1088 | 587 | ... | 0 | 13421 | 596 | 747 | 3316 | 1290 | 1868 | 00601 | Adjuntas | Puerto Rico |
| 1 | ZCTA5 00602 | 16771 | 2297 | 9981 | 594 | (X) | (X) | 2828 | 2455 | 2654 | ... | 0 | 30430 | 1908 | 855 | 7103 | 3061 | 5575 | 00602 | Aguada | Puerto Rico |
| 2 | ZCTA5 00603 | 15786 | 3520 | 11794 | 1030 | (X) | (X) | 3654 | 2615 | 2940 | ... | 14 | 36770 | 3528 | 814 | 9627 | 2558 | 7653 | 00603 | Aguadilla | Puerto Rico |
| 3 | ZCTA5 00606 | 14980 | 2954 | 6901 | 862 | (X) | (X) | 150 | 316 | 237 | ... | 0 | 5007 | 174 | 549 | 1385 | 201 | 482 | 00606 | Maricao | Puerto Rico |
| 4 | ZCTA5 00610 | 20167 | 3137 | 10643 | 832 | (X) | (X) | 1787 | 1942 | 1910 | ... | 1 | 21725 | 1868 | 733 | 5966 | 2166 | 3292 | 00610 | Anasco | Puerto Rico |
5 rows × 43 columns
df_merged.columns
Index(['NAME', 'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
'Poverty', 'ChildPoverty', 'Professional', 'Service', 'Office',
'Construction', 'Production', 'Drive', 'Carpool', 'Transit', 'Walk',
'OtherTransp', 'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork',
'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment', 'TotalPop',
'Men', 'Women', 'Hispanic', 'White', 'Black', 'Native', 'Asian',
'Pacific', 'Citizen', 'Housing with Mortgage',
'Median Monthly housing costs', 'High School Graduate',
'Some College or Associates degree', 'Bachelors degree or higher',
'ZipCode', 'City', 'State'],
dtype='object')
# To drop the Name, Poverty and ChildPoverty column as we dont have data in that column
df_merged.drop(['NAME', 'Poverty', 'ChildPoverty'], axis=1, inplace=True)
# To set the index of the data frame.
df_merged.set_index(['ZipCode'], inplace=True)
# To check number of rows and columns
df_merged.shape
(32921, 39)
# To filter out the rows that have non numeric rows in Income column
df_final = df_merged[df_merged['Income'].apply(lambda x: x.isnumeric())]
# To check the number of rows after dropping the rows that have non numeric rows in Income column
df_final.shape
(30474, 39)
# To verify the name of columns
df_final.columns
Index(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
'Professional', 'Service', 'Office', 'Construction', 'Production',
'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed',
'FamilyWork', 'Unemployment', 'TotalPop', 'Men', 'Women', 'Hispanic',
'White', 'Black', 'Native', 'Asian', 'Pacific', 'Citizen',
'Housing with Mortgage', 'Median Monthly housing costs',
'High School Graduate', 'Some College or Associates degree',
'Bachelors degree or higher', 'City', 'State'],
dtype='object')
del df_final['MeanCommute']
As the count of values is very less in Asian , Pacific and Native columns so adding all the count to new column 'RaceOther' , and then dropping the original three columns so reduce the number of columns to analyze.
convert_dict = {'Native': int,'Pacific': int,'Asian': int}
df_final = df_final.astype(convert_dict)
# # To add all the values of the Asian, Pacific and Native and creating one new column 'RaceOther'
df_final['RaceOther']= df_final['Asian'] + df_final['Pacific']+ df_final['Native']
df_final=df_final.drop(['Asian', 'Pacific','Native'], axis=1)
df_final['RaceOther']
ZipCode
00601 104
00602 4757
00603 388
00606 22
00610 2936
...
99921 572
99922 328
99925 505
99926 1410
99929 847
Name: RaceOther, Length: 30474, dtype: int32
# To verify the updated column name
df_final.columns
Index(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
'Professional', 'Service', 'Office', 'Construction', 'Production',
'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork',
'Unemployment', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White',
'Black', 'Citizen', 'Housing with Mortgage',
'Median Monthly housing costs', 'High School Graduate',
'Some College or Associates degree', 'Bachelors degree or higher',
'City', 'State', 'RaceOther'],
dtype='object')
Currently done for only Income column and it will be expanded to other column as well.
# To change the data type of Income
df_final.Income = df_final.Income.astype(int)
to_be_dropped = df_final[~df_final.iloc[:, 29].str.isnumeric()].index
df_final.drop(to_be_dropped, inplace=True)
convert_dict = {'Income':int,
'Professional':int, 'Service':int, 'Office':int, 'Construction':int, 'Production':int,
'Drive':int, 'Carpool':int, 'Transit':int, 'Walk':int, 'OtherTransp':int, 'WorkAtHome':int,
'Employed':int, 'PrivateWork':int, 'PublicWork':int, 'SelfEmployed':int, 'FamilyWork':int ,'Unemployment':int, 'TotalPop':int, 'Men':int, 'Women':int, 'Hispanic':int, 'White':int,
'Black':int, 'Citizen':int, 'Housing with Mortgage':int,
'Median Monthly housing costs':int, 'High School Graduate':int,
'Some College or Associates degree':int, 'Bachelors degree or higher':int
}
df_final = df_final.astype(convert_dict)
# To convert education related columns to percentage values
df_final.iloc[:,30:33] = df_final.iloc[:,30:33].div(df_final['TotalPop'], axis=0)
df_final.iloc[:,30:33] =df_final.iloc[:,30:33].apply(lambda x: x*100)
Our last step in managing our data is to combine multiple factors of a particular category to a single factor, using comparative calculations. Data for each state will be compared with that of the National average to determine whether the factor it lacks in needs to be improved or not.
# To verify the updated column name
df_final.columns
Index(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
'Professional', 'Service', 'Office', 'Construction', 'Production',
'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork',
'Unemployment', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White',
'Black', 'Citizen', 'Housing with Mortgage',
'Median Monthly housing costs', 'High School Graduate',
'Some College or Associates degree', 'Bachelors degree or higher',
'City', 'State', 'RaceOther'],
dtype='object')
#To calculate the CNI we need to compare the population of a state with the national average for different factors and see how much potential improvement is possible.
#In order to ensure we are accurate with our analysis, we need to treat the factors as they are featured in our day to day lives and with an emphasis on what factors can be improved
#We start by assigning weights to a dataframe for us to use in our calculations later
data_5_subfactors = [.2,.2,.2,.2,.2,.2]
weighted_averages_5_subfactors = pd.DataFrame(data_5_subfactors, columns=['Weights'])
data_3_subfactors = [.33,.33,.33]
weighted_averages_3_subfactors = pd.DataFrame(data_3_subfactors, columns=['Weights'])
# Create new pandas DataFrame.
cni_df = pd.DataFrame(df_final[['Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','Unemployment','TotalPop','Median Monthly housing costs','High School Graduate','Some College or Associates degree','Bachelors degree or higher','City','State']])
# Appending a new row in the data frame to include the National average
new_row = {'State':'National'}
cni_df = cni_df.append(pd.DataFrame([new_row],index=['00000'],columns=cni_df.columns))
cni_df.sort_index().head(10)
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].replace('-', '0')
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].replace('4,000+', '4000')
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].replace('', '0')
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].astype(float)
cni_df = cni_df.replace(np.nan, 0)
#Calculating the aggregate functions for each state
#cni_df_mean = cni_df.groupby(['State']).mean()
cni_df_sum = cni_df.groupby(['State']).sum()
cni_df_sum.loc['National'] = cni_df_sum.mean()
household_costs = cni_df_sum.at['National','Median Monthly housing costs']
cni_df_sum['Median Monthly housing costs'] = ((cni_df_sum['Median Monthly housing costs'] / household_costs) *100) - 100
#for loop to convert to percentages based on population
for column in cni_df_sum[['Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','Unemployment','High School Graduate','Some College or Associates degree','Bachelors degree or higher']]:
# Select column contents by column
cni_df_sum[column] = ((cni_df_sum[column] / (cni_df_sum['TotalPop'])) *100)
#print(cni_df_sum.at['National',column])
cni_df_sum
| Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | ... | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | TotalPop | Median Monthly housing costs | High School Graduate | Some College or Associates degree | Bachelors degree or higher | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | |||||||||||||||||||||
| Alabama | 15.517892 | 6.996747 | 9.361843 | 2.905560 | 7.460299 | 36.431098 | 3.587654 | 0.142825 | 0.438670 | 0.390554 | ... | 34.217313 | 6.819153 | 2.352346 | 0.068796 | 2.554863 | 4.854899e+06 | -18.165793 | 0.282468 | 0.072049 | 0.159863 |
| Alaska | 18.164113 | 8.300886 | 9.742261 | 3.248667 | 5.729972 | 32.782396 | 5.712424 | 0.591883 | 3.078941 | 2.340202 | ... | 32.065664 | 11.956176 | 3.210694 | 0.117657 | 3.615016 | 6.952390e+05 | -74.700786 | 0.531333 | 0.089519 | 0.260296 |
| Arizona | 16.893299 | 8.384180 | 10.764726 | 3.186526 | 4.915529 | 33.121350 | 4.746640 | 0.708042 | 0.775876 | 1.102445 | ... | 35.938092 | 6.170506 | 2.762978 | 0.082710 | 2.772566 | 7.092238e+06 | -41.171357 | 0.088568 | 0.029628 | 0.089994 |
| Arkansas | 15.233561 | 7.228191 | 9.258655 | 3.012652 | 7.508308 | 35.491897 | 4.402944 | 0.131575 | 0.685975 | 0.491731 | ... | 34.280291 | 6.625138 | 2.609711 | 0.088890 | 2.379219 | 2.982323e+06 | -38.990689 | 0.447795 | 0.081341 | 0.187184 |
| California | 18.052831 | 8.866379 | 10.003480 | 3.129476 | 5.886095 | 33.723382 | 4.729881 | 1.953256 | 0.996406 | 1.054699 | ... | 36.717911 | 6.733633 | 3.636072 | 0.089660 | 3.169723 | 3.673532e+07 | 310.408103 | 0.059096 | 0.023061 | 0.083370 |
| Colorado | 22.574445 | 8.550391 | 10.853854 | 4.182373 | 5.205295 | 37.520998 | 4.391058 | 1.453044 | 1.321423 | 1.107471 | ... | 41.502135 | 7.008471 | 3.243752 | 0.123338 | 2.510374 | 5.638163e+06 | -6.453467 | 0.138559 | 0.051702 | 0.198719 |
| Connecticut | 22.170371 | 8.985366 | 10.587594 | 3.173954 | 5.378130 | 38.758584 | 4.008446 | 1.830711 | 1.294363 | 0.639119 | ... | 41.023023 | 6.665408 | 3.116804 | 0.086822 | 3.301980 | 3.373552e+06 | -29.940409 | 0.145793 | 0.046034 | 0.214605 |
| Delaware | 19.365242 | 8.101976 | 10.224503 | 3.267541 | 5.522569 | 37.000775 | 3.975466 | 1.029805 | 0.939908 | 0.532402 | ... | 38.011725 | 7.202903 | 2.049192 | 0.090522 | 2.932537 | 9.599880e+05 | -88.392837 | 0.132495 | 0.037060 | 0.134530 |
| District of Columbia | 35.784307 | 7.463003 | 7.985717 | 1.566373 | 2.069634 | 17.397071 | 2.641422 | 17.303050 | 6.675175 | 3.687144 | ... | 38.768211 | 13.599548 | 2.336075 | 0.076042 | 4.248615 | 6.785740e+05 | -92.594017 | 0.028614 | 0.005780 | 0.149399 |
| Florida | 16.586525 | 8.972596 | 11.123186 | 3.603624 | 4.791410 | 35.094141 | 4.149938 | 0.724587 | 0.602812 | 1.037378 | ... | 37.601212 | 5.339769 | 2.702514 | 0.087979 | 2.625330 | 2.107323e+07 | 83.591540 | 0.094155 | 0.030548 | 0.091858 |
| Georgia | 18.040724 | 7.494084 | 10.323714 | 3.105414 | 6.883665 | 35.927394 | 4.293264 | 0.889119 | 0.616355 | 0.829424 | ... | 37.547414 | 6.674288 | 2.535472 | 0.088588 | 2.775890 | 1.042808e+07 | 9.697993 | 0.144374 | 0.035129 | 0.099810 |
| Hawaii | 17.037514 | 10.949130 | 10.900893 | 3.466905 | 4.426400 | 33.225414 | 6.601400 | 2.699580 | 2.107894 | 1.494572 | ... | 34.601522 | 9.696170 | 3.277787 | 0.105341 | 2.273674 | 1.409701e+06 | -76.966288 | 0.113378 | 0.042759 | 0.123871 |
| Idaho | 16.858347 | 7.894713 | 10.170935 | 3.821008 | 6.130272 | 35.884644 | 4.305448 | 0.292759 | 1.113590 | 0.903777 | ... | 36.414500 | 6.804723 | 3.396893 | 0.113224 | 2.041543 | 1.737264e+06 | -63.418886 | 0.273096 | 0.089068 | 0.203593 |
| Illinois | 19.489175 | 8.265461 | 10.550409 | 2.670306 | 7.264796 | 34.573609 | 3.762452 | 4.236620 | 1.336477 | 0.870151 | ... | 40.813050 | 5.995118 | 2.244781 | 0.081697 | 3.120640 | 1.261241e+07 | 126.048495 | 0.245124 | 0.074099 | 0.174291 |
| Indiana | 16.895616 | 7.740172 | 9.917196 | 2.982455 | 9.407204 | 38.444218 | 4.254182 | 0.415381 | 0.926558 | 0.615683 | ... | 40.660537 | 5.217092 | 2.178391 | 0.076262 | 2.382582 | 6.659917e+06 | 0.314501 | 0.288240 | 0.068709 | 0.146369 |
| Iowa | 19.032455 | 8.108007 | 10.444292 | 3.357644 | 8.723400 | 40.429741 | 4.053605 | 0.448228 | 1.526999 | 0.659090 | ... | 40.958857 | 6.875755 | 3.269006 | 0.112591 | 2.075890 | 3.135234e+06 | 34.759644 | 0.731703 | 0.260454 | 0.407015 |
| Kansas | 19.587903 | 8.112023 | 10.204043 | 3.200810 | 7.264514 | 39.946177 | 4.403689 | 0.213104 | 1.034262 | 0.642028 | ... | 38.913726 | 7.839359 | 3.032888 | 0.114489 | 2.092403 | 2.872773e+06 | -3.531575 | 0.476105 | 0.149642 | 0.370616 |
| Kentucky | 15.875338 | 7.237387 | 9.545596 | 2.729507 | 8.285739 | 35.841981 | 4.031094 | 0.413792 | 0.859868 | 0.574805 | ... | 36.155268 | 6.410191 | 2.338713 | 0.105730 | 2.539380 | 4.382880e+06 | -18.023751 | 0.363542 | 0.080603 | 0.157939 |
| Louisiana | 15.409920 | 8.167556 | 9.453325 | 3.507870 | 5.694920 | 35.204949 | 3.886365 | 0.472184 | 0.764546 | 0.809719 | ... | 34.154816 | 6.615920 | 2.704819 | 0.077293 | 3.054733 | 4.604560e+06 | -33.921620 | 0.248471 | 0.040908 | 0.118844 |
| Maine | 19.580573 | 8.775628 | 10.601421 | 3.699747 | 6.062370 | 38.218018 | 4.576935 | 0.267515 | 1.785283 | 0.639980 | ... | 39.336539 | 6.786071 | 4.249180 | 0.086046 | 2.116638 | 1.333010e+06 | -35.127933 | 0.768307 | 0.223199 | 0.632789 |
| Maryland | 23.830384 | 8.591022 | 10.066531 | 3.694789 | 4.665454 | 36.673142 | 4.388598 | 3.748723 | 1.016108 | 0.804654 | ... | 37.485173 | 11.084008 | 2.555701 | 0.079659 | 2.815132 | 5.842390e+06 | 5.969199 | 0.143354 | 0.035596 | 0.175234 |
| Massachusetts | 25.029923 | 8.962845 | 10.274134 | 3.101142 | 4.957742 | 35.395434 | 3.785999 | 4.926456 | 2.359902 | 1.093837 | ... | 43.200730 | 6.552162 | 2.972921 | 0.082096 | 2.830026 | 6.658031e+06 | 40.586487 | 0.124896 | 0.043755 | 0.234664 |
| Michigan | 17.584876 | 8.061888 | 9.650896 | 2.578421 | 7.748206 | 36.949176 | 3.943358 | 0.580674 | 0.978408 | 0.574684 | ... | 39.363872 | 4.967974 | 2.317338 | 0.076014 | 2.986536 | 9.950858e+06 | 50.505568 | 0.229525 | 0.066974 | 0.162799 |
| Minnesota | 22.305615 | 8.316324 | 10.762776 | 3.227462 | 7.194304 | 39.406791 | 4.216674 | 1.669803 | 1.326165 | 0.791574 | ... | 43.486367 | 6.417065 | 2.826572 | 0.086691 | 2.089962 | 5.594551e+06 | 52.105558 | 0.341812 | 0.142519 | 0.263010 |
| Mississippi | 13.891386 | 7.423872 | 8.713395 | 2.803837 | 7.487517 | 35.093756 | 3.826217 | 0.110477 | 0.452029 | 0.638071 | ... | 31.722896 | 7.728780 | 2.288741 | 0.117371 | 3.204380 | 2.944501e+06 | -50.171857 | 0.276613 | 0.085255 | 0.146752 |
| Missouri | 18.190505 | 8.063147 | 10.518584 | 3.139059 | 7.038833 | 38.331103 | 3.958306 | 0.549856 | 0.804997 | 0.562050 | ... | 39.361456 | 5.887171 | 2.663779 | 0.094980 | 2.264538 | 6.077044e+06 | 31.858342 | 0.405351 | 0.082820 | 0.208100 |
| Montana | 18.407747 | 9.221034 | 10.308342 | 4.045053 | 5.439803 | 36.501247 | 4.467341 | 0.350868 | 2.177973 | 1.067944 | ... | 36.780408 | 8.015906 | 4.162393 | 0.146195 | 2.109333 | 1.043126e+06 | -50.787067 | 0.593173 | 0.183599 | 0.507790 |
| National | 18.585241 | 8.349249 | 10.202391 | 3.210139 | 6.304858 | 35.350445 | 4.172649 | 2.069473 | 1.122900 | 0.835093 | ... | 38.057873 | 6.686434 | 2.818555 | 0.087964 | 2.752580 | 6.091775e+06 | 0.000000 | 0.205704 | 0.059110 | 0.159334 |
| Nebraska | 20.212290 | 8.278421 | 11.034195 | 3.637961 | 7.392674 | 41.508335 | 4.619832 | 0.325807 | 1.262182 | 0.607112 | ... | 41.379541 | 7.327492 | 3.224092 | 0.111464 | 1.823797 | 1.910026e+06 | -18.632576 | 0.608735 | 0.250951 | 0.449183 |
| Nevada | 14.526139 | 11.827021 | 10.807067 | 3.402596 | 5.658025 | 35.582648 | 4.894715 | 1.359244 | 0.740563 | 1.046410 | ... | 38.791610 | 5.606405 | 2.564620 | 0.082182 | 3.312004 | 3.020135e+06 | -73.471929 | 0.095050 | 0.027278 | 0.079621 |
| New Hampshire | 22.601028 | 8.620209 | 11.509625 | 3.694819 | 6.471457 | 42.003871 | 4.044458 | 0.417261 | 1.338915 | 0.633947 | ... | 43.149527 | 7.081976 | 3.574282 | 0.087354 | 2.102636 | 1.353111e+06 | -43.567824 | 0.364111 | 0.136798 | 0.481784 |
| New Jersey | 21.805943 | 7.953427 | 10.840442 | 3.001327 | 5.830078 | 34.102104 | 3.866212 | 5.211197 | 1.295085 | 0.945817 | ... | 40.754602 | 6.874836 | 2.284959 | 0.069129 | 3.084758 | 8.607158e+06 | 70.532684 | 0.117520 | 0.030934 | 0.176998 |
| New Mexico | 16.235249 | 8.368484 | 9.316456 | 3.098929 | 4.249889 | 33.625294 | 4.260363 | 0.425353 | 0.779945 | 0.678647 | ... | 30.340390 | 9.580912 | 2.748369 | 0.089496 | 2.980223 | 2.033606e+06 | -69.039963 | 0.209063 | 0.060187 | 0.167816 |
| New York | 19.962686 | 9.726890 | 10.162657 | 2.818871 | 4.913470 | 25.160427 | 3.100604 | 12.129205 | 2.632111 | 0.993303 | ... | 37.713690 | 7.800679 | 2.693477 | 0.071108 | 2.960639 | 1.851681e+07 | 250.664789 | 0.179984 | 0.065100 | 0.182754 |
| North Carolina | 18.224936 | 7.834720 | 9.786308 | 3.274216 | 6.717170 | 36.939190 | 4.109554 | 0.446884 | 0.680022 | 0.613556 | ... | 37.612060 | 6.490079 | 2.597726 | 0.086597 | 2.687002 | 1.030290e+07 | 21.199993 | 0.154936 | 0.052326 | 0.126957 |
| North Dakota | 20.469291 | 9.051543 | 10.689266 | 3.992895 | 6.843893 | 42.979912 | 4.681194 | 0.278289 | 1.605304 | 0.708186 | ... | 40.870746 | 8.491968 | 3.866078 | 0.157057 | 1.682538 | 7.341290e+05 | -51.037009 | 0.915645 | 0.426466 | 0.678533 |
| Ohio | 18.098172 | 8.075664 | 10.154925 | 2.686925 | 8.131203 | 38.469280 | 3.583920 | 0.676802 | 0.991343 | 0.565742 | ... | 39.762701 | 5.871590 | 2.326092 | 0.075528 | 2.679132 | 1.162615e+07 | 71.469638 | 0.258710 | 0.061051 | 0.139719 |
| Oklahoma | 16.082777 | 7.796205 | 9.892935 | 3.234548 | 6.461184 | 36.595291 | 4.350927 | 0.162698 | 0.738358 | 0.616238 | ... | 34.502210 | 7.646599 | 2.958280 | 0.107904 | 2.423582 | 3.918291e+06 | -16.801149 | 0.389395 | 0.076473 | 0.205075 |
| Oregon | 19.506370 | 8.483489 | 10.004232 | 3.128172 | 5.982575 | 33.413197 | 4.448837 | 1.898612 | 1.673337 | 1.480632 | ... | 38.106307 | 6.652683 | 3.423823 | 0.097397 | 2.781505 | 4.163358e+06 | -25.800997 | 0.176386 | 0.060836 | 0.169809 |
| Pennsylvania | 19.395753 | 8.250551 | 10.120244 | 2.899997 | 6.907047 | 35.495893 | 3.907322 | 2.473363 | 1.628873 | 0.746123 | ... | 40.871552 | 5.192466 | 2.441574 | 0.075718 | 2.748137 | 1.270246e+07 | 173.502064 | 0.373152 | 0.083435 | 0.211696 |
| Puerto Rico | 10.369517 | 6.501285 | 8.184341 | 1.763543 | 3.614495 | 25.630334 | 2.360733 | 0.463876 | 0.811053 | 0.533410 | ... | 21.740750 | 6.610693 | 3.153418 | 0.048212 | 5.603530 | 3.250219e+06 | -86.179069 | 0.077504 | 0.028081 | 0.070680 |
| Rhode Island | 20.519943 | 9.340266 | 10.967426 | 3.082906 | 6.041975 | 39.545489 | 4.007028 | 1.114815 | 1.476163 | 0.645869 | ... | 41.551377 | 6.551567 | 2.476068 | 0.081007 | 2.965907 | 1.053000e+06 | -82.181309 | 0.126942 | 0.043169 | 0.171293 |
| South Carolina | 16.313769 | 7.919001 | 9.993936 | 3.143556 | 7.160914 | 37.114389 | 4.057505 | 0.230994 | 0.586460 | 0.646184 | ... | 36.136964 | 6.834258 | 2.495605 | 0.097330 | 2.659608 | 5.036494e+06 | -43.186005 | 0.173039 | 0.050290 | 0.114612 |
| South Dakota | 19.219038 | 8.610229 | 10.989418 | 3.689441 | 7.132849 | 40.722084 | 4.106377 | 0.232588 | 1.716101 | 0.586305 | ... | 39.594528 | 7.759845 | 4.027118 | 0.158990 | 1.764695 | 8.478520e+05 | -54.679404 | 0.828174 | 0.288678 | 0.531700 |
| Tennessee | 16.885217 | 7.681539 | 10.142055 | 3.080457 | 7.778924 | 37.629158 | 4.017868 | 0.274707 | 0.530062 | 0.545297 | ... | 37.122462 | 6.179316 | 3.156276 | 0.088943 | 2.609092 | 6.727897e+06 | -12.650009 | 0.230936 | 0.042470 | 0.125883 |
| Texas | 17.651444 | 8.024465 | 10.358714 | 4.099250 | 6.069652 | 36.622279 | 4.634543 | 0.585768 | 0.645721 | 0.728613 | ... | 37.547453 | 6.270239 | 3.193893 | 0.098382 | 2.641412 | 2.816679e+07 | 202.270095 | 0.111904 | 0.027794 | 0.091578 |
| Utah | 19.401307 | 7.297096 | 11.578655 | 3.537830 | 6.296299 | 35.862658 | 5.017803 | 1.072669 | 1.114570 | 0.729783 | ... | 39.322347 | 7.095843 | 2.361080 | 0.091575 | 1.821483 | 3.126408e+06 | -54.987856 | 0.132280 | 0.051307 | 0.133141 |
| Vermont | 22.810934 | 8.795629 | 10.101763 | 3.793831 | 5.764614 | 38.657872 | 4.414257 | 0.609410 | 2.532730 | 0.881069 | ... | 40.641452 | 7.690850 | 4.363877 | 0.113880 | 2.058258 | 6.173180e+05 | -51.324871 | 0.902031 | 0.236649 | 1.048757 |
| Virginia | 22.288205 | 8.153862 | 9.788953 | 3.260821 | 5.219895 | 37.370689 | 4.451882 | 1.979120 | 1.028529 | 0.866081 | ... | 37.101869 | 9.865304 | 2.383311 | 0.084018 | 2.400365 | 8.320818e+06 | 46.977457 | 0.199920 | 0.051990 | 0.182205 |
| Washington | 20.768858 | 8.088880 | 9.525391 | 3.418154 | 5.869852 | 33.722919 | 4.692491 | 2.890937 | 1.576568 | 0.979126 | ... | 38.313013 | 7.670986 | 2.816186 | 0.088425 | 2.525203 | 7.379130e+06 | 16.752389 | 0.129915 | 0.053226 | 0.152064 |
| West Virginia | 14.423560 | 7.868152 | 9.042487 | 2.694171 | 5.934481 | 33.570453 | 3.542311 | 0.352015 | 1.141360 | 0.488274 | ... | 31.764317 | 8.149857 | 1.763450 | 0.056353 | 2.847876 | 1.719527e+06 | -44.836427 | 0.832852 | 0.127327 | 0.281314 |
| Wisconsin | 19.148602 | 8.302859 | 10.415931 | 3.033333 | 9.154616 | 40.493168 | 3.889678 | 0.771039 | 1.453320 | 0.748353 | ... | 42.389704 | 6.294605 | 2.618012 | 0.104956 | 1.889446 | 5.796726e+06 | 31.932751 | 0.325192 | 0.107715 | 0.221534 |
| Wyoming | 17.690173 | 8.550883 | 9.357948 | 4.051109 | 6.342138 | 38.111351 | 4.807987 | 0.494241 | 1.616081 | 0.857429 | ... | 35.678096 | 10.291278 | 3.211059 | 0.191348 | 2.301315 | 5.638950e+05 | -80.412562 | 0.418795 | 0.142173 | 0.324341 |
53 rows × 22 columns
#for loop to convert to percentages on a National level
for column in cni_df_sum[['Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','Unemployment','High School Graduate','Some College or Associates degree','Bachelors degree or higher']]:
# Select column contents by column
cni_df_sum[column] = ((cni_df_sum[column] / (cni_df_sum.at['National',column])) *100) - 100
#print(cni_df_sum.at['National',column])
cni_df_sum.head()
| Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | ... | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | TotalPop | Median Monthly housing costs | High School Graduate | Some College or Associates degree | Bachelors degree or higher | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | |||||||||||||||||||||
| Alabama | -16.504219 | -16.199088 | -8.238740 | -9.488029 | 18.326204 | 3.056970 | -14.019750 | -93.098495 | -60.934176 | -53.232258 | ... | -10.091368 | 1.984903 | -16.540740 | -21.790411 | -7.182996 | 4854899.0 | -18.165793 | 37.317488 | 21.891077 | 0.332283 |
| Alaska | -2.265927 | -0.579247 | -4.510018 | 1.200202 | -9.118147 | -7.264547 | 36.901604 | -71.399352 | 174.195393 | 180.232684 | ... | -15.744993 | 78.812447 | 13.912764 | 33.755889 | 31.331892 | 695239.0 | -74.700786 | 158.299280 | 51.445707 | 63.365159 |
| Arizona | -9.103686 | 0.418370 | 5.511798 | -0.735574 | -22.035855 | -6.305705 | 13.756019 | -65.786384 | -30.904257 | 32.014661 | ... | -5.569889 | -7.716035 | -1.971831 | -5.973019 | 0.726080 | 7092238.0 | -41.171357 | -56.944196 | -49.875740 | -43.518504 |
| Arkansas | -18.034095 | -13.427055 | -9.250145 | -6.151986 | 19.087660 | 0.400141 | 5.519135 | -93.642089 | -38.910404 | -41.116617 | ... | -9.925889 | -0.916729 | -7.409642 | 1.052901 | -13.564044 | 2982323.0 | -38.990689 | 117.688837 | 37.610423 | 17.478773 |
| California | -2.864696 | 6.193723 | -1.949648 | -2.512768 | -6.641919 | -4.602667 | 13.354371 | -5.615787 | -11.264967 | 26.297216 | ... | -3.520854 | 0.705894 | 29.004818 | 1.928090 | 15.154599 | 36735325.0 | 310.408103 | -71.271531 | -60.986107 | -47.675711 |
5 rows × 22 columns
final_cni = pd.DataFrame(cni_df_sum.copy())
final_cni = final_cni.drop(['Unemployment','TotalPop','Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','High School Graduate','Some College or Associates degree','Bachelors degree or higher'], axis=1)
final_cni
| Median Monthly housing costs | |
|---|---|
| State | |
| Alabama | -18.165793 |
| Alaska | -74.700786 |
| Arizona | -41.171357 |
| Arkansas | -38.990689 |
| California | 310.408103 |
| Colorado | -6.453467 |
| Connecticut | -29.940409 |
| Delaware | -88.392837 |
| District of Columbia | -92.594017 |
| Florida | 83.591540 |
| Georgia | 9.697993 |
| Hawaii | -76.966288 |
| Idaho | -63.418886 |
| Illinois | 126.048495 |
| Indiana | 0.314501 |
| Iowa | 34.759644 |
| Kansas | -3.531575 |
| Kentucky | -18.023751 |
| Louisiana | -33.921620 |
| Maine | -35.127933 |
| Maryland | 5.969199 |
| Massachusetts | 40.586487 |
| Michigan | 50.505568 |
| Minnesota | 52.105558 |
| Mississippi | -50.171857 |
| Missouri | 31.858342 |
| Montana | -50.787067 |
| National | 0.000000 |
| Nebraska | -18.632576 |
| Nevada | -73.471929 |
| New Hampshire | -43.567824 |
| New Jersey | 70.532684 |
| New Mexico | -69.039963 |
| New York | 250.664789 |
| North Carolina | 21.199993 |
| North Dakota | -51.037009 |
| Ohio | 71.469638 |
| Oklahoma | -16.801149 |
| Oregon | -25.800997 |
| Pennsylvania | 173.502064 |
| Puerto Rico | -86.179069 |
| Rhode Island | -82.181309 |
| South Carolina | -43.186005 |
| South Dakota | -54.679404 |
| Tennessee | -12.650009 |
| Texas | 202.270095 |
| Utah | -54.987856 |
| Vermont | -51.324871 |
| Virginia | 46.977457 |
| Washington | 16.752389 |
| West Virginia | -44.836427 |
| Wisconsin | 31.932751 |
| Wyoming | -80.412562 |
#Type of Work
final_cni['Type of Work'] = (cni_df_sum['Professional'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['Service'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['Office'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[2]["Weights"]))) + (cni_df_sum['Construction'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[3]["Weights"]))) + (cni_df_sum['Production'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[4]["Weights"])))
final_cni['Mode of Transport'] = (cni_df_sum['Drive'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['Carpool'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['Transit'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[2]["Weights"]))) + (cni_df_sum['Walk'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[3]["Weights"]))) + (cni_df_sum['OtherTransp'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[4]["Weights"])))
final_cni['Employment Category'] = (cni_df_sum['WorkAtHome'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['PrivateWork'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['PublicWork'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[2]["Weights"]))) + (cni_df_sum['SelfEmployed'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[3]["Weights"]))) + (cni_df_sum['FamilyWork'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[4]["Weights"])))
final_cni['Educational Attainment'] = (cni_df_sum['High School Graduate'].apply(lambda x: (x * weighted_averages_3_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['Some College or Associates degree'].apply(lambda x: (x * weighted_averages_3_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['Bachelors degree or higher'].apply(lambda x: (x * weighted_averages_3_subfactors.iloc[2]["Weights"])))
final_cni
| Median Monthly housing costs | Type of Work | Mode of Transport | Employment Category | Educational Attainment | |
|---|---|---|---|---|---|
| State | |||||
| Alabama | -18.165793 | -6.420775 | -43.645542 | -17.924669 | 19.648480 |
| Alaska | -74.700786 | -3.054627 | 62.533157 | 19.692507 | 90.126348 |
| Arizona | -41.171357 | -5.188989 | -11.445133 | -0.601713 | -49.611685 |
| Arkansas | -38.990689 | -5.555124 | -33.549967 | -12.125266 | 57.016751 |
| California | 310.408103 | -1.555062 | 3.633633 | 7.297230 | -59.378005 |
| Colorado | -6.453467 | 8.621056 | 6.376720 | 27.242655 | -6.749935 |
| Connecticut | -29.940409 | 2.971766 | -2.805861 | 4.603793 | -5.463660 |
| Delaware | -88.392837 | -1.833516 | -20.567638 | -3.219296 | -29.191989 |
| District of Columbia | -92.594017 | -11.635916 | 296.921739 | 34.580900 | -60.240443 |
| Florida | 83.591540 | -1.202009 | -17.669933 | -4.184747 | -47.815837 |
| Georgia | 9.697993 | -1.213019 | -19.660602 | -0.732379 | -35.555028 |
| Hawaii | -76.966288 | 1.572523 | 49.866456 | 11.657827 | -31.284312 |
| Idaho | -63.418886 | 0.243237 | -14.752802 | 10.653743 | 36.703288 |
| Illinois | 126.048495 | 1.136037 | 23.181959 | -5.423844 | 17.789840 |
| Indiana | 0.314501 | 4.586277 | -22.596285 | -14.760262 | 15.915022 |
| Iowa | 34.759644 | 8.968575 | -10.382867 | 10.230992 | 248.088518 |
| Kansas | -3.531575 | 3.500025 | -20.435533 | 10.243131 | 137.680629 |
| Kentucky | -18.023751 | -3.577902 | -27.320012 | -7.007586 | 37.031706 |
| Louisiana | -33.921620 | -5.400558 | -23.881518 | -14.558129 | -11.686857 |
| Maine | -35.127933 | 5.155863 | -6.729627 | 14.668224 | 279.922128 |
| Maryland | 5.969199 | 3.776286 | 15.381083 | 13.039190 | -19.836673 |
| Massachusetts | 40.586487 | 3.593387 | 54.011956 | 7.319763 | -5.934301 |
| Michigan | 50.505568 | -2.203218 | -23.392908 | -14.332338 | 8.929581 |
| Minnesota | 52.105558 | 7.952634 | 1.221527 | 8.493854 | 89.874171 |
| Mississippi | -50.171857 | -8.966541 | -37.405498 | -9.807020 | 23.366263 |
| Missouri | 31.858342 | 1.395160 | -26.228471 | -2.821347 | 55.365482 |
| Montana | -50.787067 | 4.562630 | 9.823041 | 28.973448 | 203.829237 |
| National | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| Nebraska | -18.632576 | 9.328014 | -14.203254 | 10.344619 | 231.789416 |
| Nevada | -73.471929 | 4.295213 | -5.020433 | -9.946994 | -52.032452 |
| New Hampshire | -43.567824 | 11.081308 | -13.787547 | 15.407495 | 135.567691 |
| New Jersey | 70.532684 | 0.961468 | 33.906069 | -5.216583 | -26.218678 |
| New Mexico | -69.039963 | -11.431078 | -26.300056 | 0.590440 | 2.897117 |
| New York | 250.664789 | -2.146961 | 116.986238 | -3.010155 | 4.068501 |
| North Carolina | 21.199993 | -0.728773 | -28.278544 | -1.220524 | -18.637267 |
| North Dakota | -51.037009 | 11.250868 | -5.003790 | 26.368925 | 426.514081 |
| Ohio | 71.469638 | 1.261147 | -23.310493 | -10.725887 | 5.525082 |
| Oklahoma | -16.801149 | -3.976402 | -28.959381 | -0.046179 | 48.635748 |
| Oregon | -25.800997 | -0.608657 | 23.840648 | 13.139337 | -1.570167 |
| Pennsylvania | 173.502064 | 0.452725 | 9.594976 | -8.066395 | 51.287830 |
| Puerto Rico | -86.179069 | -34.770746 | -42.480419 | -29.781819 | -56.250793 |
| Rhode Island | -82.181309 | 4.329001 | -5.886371 | -6.406690 | -19.057829 |
| South Carolina | -43.186005 | -1.582936 | -31.400353 | -5.427653 | -19.426873 |
| South Dakota | -54.679404 | 8.462723 | -10.423583 | 28.973767 | 305.145643 |
| Tennessee | -12.650009 | 0.320855 | -34.297338 | -2.135497 | -12.169837 |
| Texas | 202.270095 | 3.316845 | -22.454659 | 2.966915 | -46.564196 |
| Utah | -54.987856 | 3.070208 | -7.963179 | 5.845127 | -21.559777 |
| Vermont | -51.324871 | 7.342181 | 15.130419 | 29.344979 | 395.036111 |
| Virginia | 46.977457 | -0.419572 | 0.669468 | 8.924204 | -0.165590 |
| Washington | 16.752389 | 0.315083 | 21.039538 | 9.262179 | -16.949007 |
| West Virginia | -44.836427 | -12.294210 | -28.603684 | -22.259071 | 163.958084 |
| Wisconsin | 31.932751 | 8.852062 | -7.187440 | 3.254867 | 59.187157 |
| Wyoming | -80.412562 | 3.222131 | -1.297245 | 33.186922 | 114.732941 |
This section will attempt to answer the questions of interest that will assist in building the needs index. We will perfrom exploratory data anlysis to study relations between education, type of work and income.
The public sector, also known as the state sector, is a sector of the economy that includes both public services and public enterprises. It excludes private businesses, non-profit organizations, and households. Public sectors include the military, law enforcement, infrastructure, public transportation, public education, and health care, as well as those working for the government itself, such as elected officials.
The private sector is the segment of the economy owned, managed, and controlled by profit-seeking individuals and organizations. Companies in the private sector are typically not owned or controlled by the state.
The private sector is typically the largest, employing roughly ten times as many people as the public sector.
The self-employed proportion of the population is typically covered by the private sector, but to distinguish freelancers from organizationally employed workers, we treat the self-employed count as a separate entity.
It is widely assumed that public jobs raise the median household income of the entire state. We've come to find out if that's correct.
print('Median household income income in the US is : $ ',round(df_final['Income'].mean(),2))
Median household income income in the US is : $ 62995.77
income_state= df_final.groupby(['State'],as_index=False)['Income'].mean().sort_values("Income", ascending = False).reset_index()
del income_state['index']
income_state.index = np.arange(1, len(income_state) + 1)
income_state.head(5)
| State | Income | |
|---|---|---|
| 1 | New Jersey | 96061.160305 |
| 2 | District of Columbia | 95414.714286 |
| 3 | Maryland | 92802.761194 |
| 4 | Massachusetts | 90983.526860 |
| 5 | Connecticut | 89040.648221 |
# Creating one dictionary to replace the State name with the State code
us_state= {
"Alabama": "AL",
"Alaska": "AK",
"Arizona": "AZ",
"Arkansas": "AR",
"California": "CA",
"Colorado": "CO",
"Connecticut": "CT",
"Delaware": "DE",
"Florida": "FL",
"Georgia": "GA",
"Hawaii": "HI",
"Idaho": "ID",
"Illinois": "IL",
"Indiana": "IN",
"Iowa": "IA",
"Kansas": "KS",
"Kentucky": "KY",
"Louisiana": "LA",
"Maine": "ME",
"Maryland": "MD",
"Massachusetts": "MA",
"Michigan": "MI",
"Minnesota": "MN",
"Mississippi": "MS",
"Missouri": "MO",
"Montana": "MT",
"Nebraska": "NE",
"Nevada": "NV",
"New Hampshire": "NH",
"New Jersey": "NJ",
"New Mexico": "NM",
"New York": "NY",
"North Carolina": "NC",
"North Dakota": "ND",
"Ohio": "OH",
"Oklahoma": "OK",
"Oregon": "OR",
"Pennsylvania": "PA",
"Rhode Island": "RI",
"South Carolina": "SC",
"South Dakota": "SD",
"Tennessee": "TN",
"Texas": "TX",
"Utah": "UT",
"Vermont": "VT",
"Virginia": "VA",
"Washington": "WA",
"West Virginia": "WV",
"Wisconsin": "WI",
"Wyoming": "WY",
"District of Columbia": "DC",
"American Samoa": "AS",
"Guam": "GU",
"Northern Mariana Islands": "MP",
"Puerto Rico": "PR",
"United States Minor Outlying Islands": "UM",
"U.S. Virgin Islands": "VI",
}
# To replace the state with the state code
income_state['State'] = income_state['State'].replace(us_state)
# Converting with State code to string data type
income_state.State.astype(str)
1 NJ 2 DC 3 MD 4 MA 5 CT 6 RI 7 NH 8 HI 9 CA 10 NY 11 DE 12 WA 13 CO 14 VA 15 UT 16 ND 17 MN 18 IL 19 WY 20 VT 21 NV 22 WI 23 PA 24 IA 25 NE 26 OR 27 SD 28 TX 29 AK 30 IN 31 FL 32 OH 33 MI 34 KS 35 ME 36 AZ 37 MT 38 ID 39 NC 40 GA 41 MO 42 TN 43 OK 44 SC 45 AL 46 LA 47 KY 48 WV 49 NM 50 AR 51 MS 52 PR Name: State, dtype: object
df_work2 = df_final[['PrivateWork', 'PublicWork', 'SelfEmployed','FamilyWork', 'Income', 'Unemployment', 'State', 'TotalPop']]
df_work2['TotalWorkPop']= df_work2['PrivateWork'] + df_work2['PublicWork']+ df_work2['SelfEmployed'] + df_work2['FamilyWork']
df_state= df_work2.groupby(['State'],as_index=False)['TotalPop'].sum().sort_values("TotalPop", ascending = False)
df_work= df_work2.groupby(['State'],as_index=False)['PrivateWork', 'PublicWork', 'SelfEmployed','FamilyWork', 'Unemployment', 'TotalWorkPop'].sum().sort_values("PublicWork", ascending = False)
df_income= df_work2.groupby(['State'],as_index=False)['Income'].mean().sort_values("Income", ascending = False)
df_map = df_state.merge(df_work,on='State').merge(df_income,on='State')
# To convert work related columns to percentage values
df_map.iloc[:,2:7] = df_map.iloc[:,2:7].div(df_map['TotalWorkPop'],axis=0)
df_map.iloc[:,2:7] =df_map.iloc[:,2:7].apply(lambda x: x*100)
df_map
| State | TotalPop | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | TotalWorkPop | Income | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | California | 36735325 | 77.829654 | 14.273043 | 7.707254 | 0.190050 | 6.718749 | 17330726 | 75406.715000 |
| 1 | Texas | 28166789 | 79.701719 | 13.309793 | 6.779654 | 0.208834 | 5.606907 | 13269365 | 61707.793103 |
| 2 | Florida | 21073233 | 82.221737 | 11.676355 | 5.909527 | 0.192382 | 5.740752 | 9637100 | 60075.754135 |
| 3 | New York | 18516812 | 78.116212 | 16.157514 | 5.578987 | 0.147286 | 6.132360 | 8939723 | 72053.825133 |
| 4 | Pennsylvania | 12702461 | 84.130198 | 10.688197 | 5.025748 | 0.155857 | 5.656778 | 6171022 | 63147.635317 |
| 5 | Illinois | 12612412 | 83.063690 | 12.201407 | 4.568631 | 0.166272 | 6.351201 | 6197064 | 65904.558552 |
| 6 | Ohio | 11626152 | 82.777031 | 12.223335 | 4.842402 | 0.157232 | 5.577353 | 5584728 | 60064.601277 |
| 7 | Georgia | 10428079 | 80.151145 | 14.247367 | 5.412382 | 0.189105 | 5.925595 | 4885113 | 54812.826484 |
| 8 | North Carolina | 10302895 | 80.390904 | 13.871703 | 5.552303 | 0.185090 | 5.743119 | 4820360 | 54940.244265 |
| 9 | Michigan | 9950858 | 84.245492 | 10.632322 | 4.959504 | 0.162682 | 6.391704 | 4649558 | 59328.206568 |
| 10 | New Jersey | 8607158 | 81.536070 | 13.754204 | 4.571424 | 0.138303 | 6.171550 | 4302161 | 96061.160305 |
| 11 | Virginia | 8320818 | 75.052579 | 19.956313 | 4.821150 | 0.169959 | 4.855647 | 4113355 | 69735.424324 |
| 12 | Washington | 7379130 | 78.367975 | 15.690742 | 5.760413 | 0.180870 | 5.165217 | 3607554 | 70725.940840 |
| 13 | Arizona | 7092238 | 79.943637 | 13.726180 | 6.146196 | 0.183987 | 6.167524 | 3188265 | 58111.956140 |
| 14 | Tennessee | 6727897 | 79.752647 | 13.275434 | 6.780837 | 0.191082 | 5.605285 | 3131634 | 52762.074913 |
| 15 | Indiana | 6659917 | 84.476645 | 10.839071 | 4.525841 | 0.158443 | 4.950071 | 3205570 | 60169.627809 |
| 16 | Massachusetts | 6658031 | 81.807309 | 12.407540 | 5.629689 | 0.155462 | 5.359095 | 3515967 | 90983.526860 |
| 17 | Missouri | 6077044 | 81.990416 | 12.263053 | 5.548685 | 0.197845 | 4.717063 | 2917430 | 54102.435897 |
| 18 | Maryland | 5842390 | 73.206735 | 21.646534 | 4.991160 | 0.155571 | 5.497817 | 2991569 | 92802.761194 |
| 19 | Wisconsin | 5796726 | 82.458567 | 12.244580 | 5.092688 | 0.204165 | 3.675444 | 2979939 | 63733.914324 |
| 20 | Colorado | 5638163 | 79.999959 | 13.509603 | 6.252690 | 0.237748 | 4.839025 | 2924949 | 70080.758542 |
| 21 | Minnesota | 5594551 | 82.334509 | 12.149691 | 5.351663 | 0.164137 | 3.957010 | 2954857 | 67039.682783 |
| 22 | South Carolina | 5036494 | 79.310068 | 14.999198 | 5.477123 | 0.213610 | 5.837062 | 2294836 | 50157.204360 |
| 23 | Alabama | 4854899 | 78.737221 | 15.691506 | 5.412966 | 0.158307 | 5.878977 | 2109823 | 49457.133690 |
| 24 | Louisiana | 4604560 | 78.421545 | 15.190556 | 6.210429 | 0.177469 | 7.013853 | 2005417 | 49176.983721 |
| 25 | Kentucky | 4382880 | 80.327364 | 14.241736 | 5.195997 | 0.234903 | 5.641826 | 1972730 | 48647.682968 |
| 26 | Oregon | 4163358 | 78.927385 | 13.779316 | 7.091566 | 0.201733 | 5.761169 | 2010078 | 62107.326923 |
| 27 | Oklahoma | 3918291 | 76.307012 | 16.911645 | 6.542696 | 0.238647 | 5.360129 | 1771655 | 51905.199662 |
| 28 | Connecticut | 3373552 | 80.607909 | 13.097148 | 6.124343 | 0.170601 | 6.488202 | 1716870 | 89040.648221 |
| 29 | Puerto Rico | 3250219 | 68.902163 | 20.951027 | 9.994013 | 0.152797 | 17.759063 | 1025544 | 20779.390244 |
| 30 | Iowa | 3135234 | 79.972449 | 13.424958 | 6.382757 | 0.219835 | 4.053189 | 1605748 | 62719.079955 |
| 31 | Utah | 3126408 | 80.461770 | 14.519583 | 4.831265 | 0.187381 | 3.727137 | 1527902 | 69376.398340 |
| 32 | Nevada | 3020135 | 82.456713 | 11.917158 | 5.451441 | 0.174688 | 7.040104 | 1420817 | 64141.641791 |
| 33 | Arkansas | 2982323 | 78.617255 | 15.193865 | 5.985022 | 0.203858 | 5.456420 | 1300413 | 46117.698745 |
| 34 | Mississippi | 2944501 | 75.787321 | 18.464377 | 5.467897 | 0.280405 | 7.655397 | 1232503 | 42658.184066 |
| 35 | Kansas | 2872773 | 77.982696 | 15.709994 | 6.077876 | 0.229434 | 4.193154 | 1433527 | 59156.324717 |
| 36 | New Mexico | 2033606 | 70.956457 | 22.406685 | 6.427555 | 0.209303 | 6.969788 | 869553 | 47735.769608 |
| 37 | Nebraska | 1910026 | 79.510919 | 14.079799 | 6.195103 | 0.214179 | 3.504432 | 994027 | 62234.540856 |
| 38 | Idaho | 1737264 | 77.926416 | 14.561992 | 7.269294 | 0.242297 | 4.368869 | 811812 | 56274.102222 |
| 39 | West Virginia | 1719527 | 76.111406 | 19.528111 | 4.225454 | 0.135028 | 6.823879 | 717627 | 47934.045872 |
| 40 | Hawaii | 1409701 | 72.569058 | 20.335576 | 6.874436 | 0.220931 | 4.768529 | 672157 | 76795.358974 |
| 41 | New Hampshire | 1353111 | 80.064972 | 13.140775 | 6.632165 | 0.162088 | 3.901491 | 729234 | 81356.851064 |
| 42 | Maine | 1333010 | 77.959227 | 13.448993 | 8.421250 | 0.170530 | 4.194865 | 672608 | 58251.154242 |
| 43 | Rhode Island | 1053000 | 82.020058 | 12.932421 | 4.887618 | 0.159903 | 5.854532 | 533450 | 81764.101449 |
| 44 | Montana | 1043126 | 74.901704 | 16.324044 | 8.476532 | 0.297720 | 4.295565 | 512226 | 57263.481481 |
| 45 | Delaware | 959988 | 80.270834 | 15.210649 | 4.327359 | 0.191159 | 6.192751 | 454596 | 71671.214286 |
| 46 | South Dakota | 847852 | 76.822194 | 15.055825 | 7.813505 | 0.308476 | 3.423900 | 436987 | 61790.978947 |
| 47 | North Dakota | 734129 | 76.557265 | 15.906777 | 7.241766 | 0.294192 | 3.151656 | 391921 | 68542.468852 |
| 48 | Alaska | 695239 | 67.720240 | 25.250534 | 6.780742 | 0.248483 | 7.634638 | 329197 | 61017.500000 |
| 49 | District of Columbia | 678574 | 70.770899 | 24.825811 | 4.264477 | 0.138813 | 7.755796 | 371722 | 95414.714286 |
| 50 | Vermont | 617318 | 76.957786 | 14.563229 | 8.263345 | 0.215640 | 3.897474 | 326006 | 65786.290456 |
| 51 | Wyoming | 563895 | 72.264148 | 20.844453 | 6.503834 | 0.387565 | 4.661195 | 278405 | 65843.103774 |
df_map = df_map.round(1)
# Creating new column of State Code based on State
df_map['State Code'] = df_map['State'].replace(us_state)
# To display the first five rows of the data frame
df_map.sort_values(by=['Income','PrivateWork'], ascending=False).iloc[:,[1,2,3,8,-2]].head()
| TotalPop | PrivateWork | PublicWork | Income | Income | |
|---|---|---|---|---|---|
| 10 | 8607158 | 81.5 | 13.8 | 96061.2 | 96061.2 |
| 49 | 678574 | 70.8 | 24.8 | 95414.7 | 95414.7 |
| 18 | 5842390 | 73.2 | 21.6 | 92802.8 | 92802.8 |
| 16 | 6658031 | 81.8 | 12.4 | 90983.5 | 90983.5 |
| 28 | 3373552 | 80.6 | 13.1 | 89040.6 | 89040.6 |
# To create a new column to display the hover text
df_map['text'] = 'State: ' + df_map['State'].astype(str) + '<br>' + 'Private Work: ' + df_map['PrivateWork'].astype(str)+ '%'+ '<br>' + 'Public Work: ' + df_map['PublicWork'].astype(str)+ '%'+ '<br>' + 'Self Employed: ' + df_map['SelfEmployed'].astype(str) + '%'+ '<br>' + 'Family Work: ' + df_map['FamilyWork'].astype(str) + '%'+ '<br>' + 'Unemployment: ' + df_map['Unemployment'].astype(str) +'%'
# To plot the map to show the Work Distribution
fig = go.Figure(data = go.Choropleth(
locations = df_map['State Code'],
z = df_map['Income'],
locationmode = "USA-states",
colorscale = "turbo",
colorbar_title = "Income Distribution (in $)",
text = df_map['text']))
fig.update_layout(
title={
'text': "Distribution of Median Household Income and type of work across US States",
'y':0.9,
'x':0.4,
'xanchor': 'center',
'yanchor': 'top'},
geo_scope = "usa"
)
fig.show()
Public sectors include the military, law enforcement, infrastructure, public transportation, public education, and health care, as well as those working for the government itself, such as elected officials.
The private sector is the segment of the economy owned, managed, and controlled by profit-seeking individuals and organizations. Companies in the private sector are typically not owned or controlled by the state. Here are some examples: Plumbers, technicians, contractors, developers, and designers are examples of sole proprietorships. Partnerships in the legal, accounting, tax, and dentistry fields. Hospitality, leisure, retail, and food are examples of privately held corporations.
The self-employed proportion of the population is typically covered by the private sector, but to distinguish free lancers from organizationally employed workers, we treat the self-employed count as a separate entity.
It is widely assumed that private through top 3 states with highest income. jobs raise the median household income of the entire state. We've come to find out if that's correct
A chart of the United States allows us to see the various levels of work distribution across the states as well as their median earnings. Hovering over each state displays its median statewide income, workforce proportions in the private, public, and self-employed sectors, as well as the population engaged in family work and the unfortunate proportion of people who remain unemployed.
It is clear that states with more public sector workers have higher median household incomes than states with fewer public workers. However, states is significantly lower private sector workers seem to be still trailing behind the majority. There could be several reasons for this to occur. To start with, increasing public sector employment can be an effective short-term method of reducing unemployment and providing a stabilizing effect. Employment in the public sector can stimulate demand in other sectors of the economy (for example, private services) and helps to implement equal and fair policies such as encouraging marginalized and/or disadvantaged groups to work. The above US map shows that in states such as Maryland and Virginia, and even in the District of Columbia, where public sector employment is higher, private sector employment is nearly four times higher. Naturally, a common inference would be that a state with more public sector workers would do better and have a higher median household income, and it’s true.
df_corr_work=df_final[['Income','PrivateWork', 'PublicWork', 'SelfEmployed','FamilyWork']]
df_corr_work.div(df_final['Employed'], axis=0)
df_corr_work.corr()
| Income | PrivateWork | PublicWork | SelfEmployed | FamilyWork | |
|---|---|---|---|---|---|
| Income | 1.000000 | 0.231417 | 0.215182 | 0.212151 | 0.129838 |
| PrivateWork | 0.231417 | 1.000000 | 0.865973 | 0.891819 | 0.546711 |
| PublicWork | 0.215182 | 0.865973 | 1.000000 | 0.788791 | 0.488799 |
| SelfEmployed | 0.212151 | 0.891819 | 0.788791 | 1.000000 | 0.536307 |
| FamilyWork | 0.129838 | 0.546711 | 0.488799 | 0.536307 | 1.000000 |
When it comes to understanding how a particular state's income is high, we look at the education levels of the most educated state and the least educated and compare their income levels.
df_edu = df_final[['High School Graduate','Some College or Associates degree', 'Bachelors degree or higher','TotalPop', 'Income', 'State']]
edu1_state= df_final.groupby(['State'],as_index=False)['High School Graduate'].mean().sort_values("High School Graduate", ascending = False)
#df_final['Some College or Associates degree'] = df_final['Some College or Associates degree'].astype(int)
edu2_state= df_final.groupby(['State'],as_index=False)['Some College or Associates degree'].mean().sort_values("Some College or Associates degree", ascending = False)
#df_final['Bachelors degree or higher'] = df_final['Bachelors degree or higher'].astype(int)
edu3_state= df_final.groupby(['State'],as_index=False)['Bachelors degree or higher'].mean().sort_values("Bachelors degree or higher", ascending = False)
dup_edu= edu1_state.merge(edu2_state,on='State').merge(edu3_state,on='State')
dup_edu.head()
| State | High School Graduate | Some College or Associates degree | Bachelors degree or higher | |
|---|---|---|---|---|
| 0 | West Virginia | 32.846588 | 5.021604 | 11.094666 |
| 1 | Pennsylvania | 30.325949 | 6.780703 | 17.204514 |
| 2 | Arkansas | 27.938707 | 5.074999 | 11.678701 |
| 3 | Missouri | 27.461924 | 5.610955 | 14.098444 |
| 4 | Ohio | 27.443460 | 6.476209 | 14.821157 |
When it comes to understanding how a particular state's income is high, we look at the education level within the state, and the second part is to see if the minority race has received the same level of education as well. We will examine equally racially distributed states to see if their education level varies and, if so, how it affects income.
fig = px.bar(dup_edu, x="State", y=['High School Graduate', 'Some College or Associates degree', 'Bachelors degree or higher'], title="Education Distribution across States", width=1100, height=600)
fig.update_layout(barmode='stack', xaxis={'categoryorder': 'total descending'})
fig.update_layout(
title={
'text': "Education Distribution across States",
'y':0.9,
'x':0.4,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="State",
yaxis_title="Percentage of total population",
legend_title="Education Level")
fig.show()
The stacked bar chart is very straightforward. It displays the educational levels of each state as well as the proportion of its population that is educated. Furthermore, the chart is arranged in descending order, beginning with the state with the highest educated population and ending with the state with the lowest.
The inference from the bar shows a trend of how education levels vary in each state, and thus income fluctuation will also stand in accordance with it. We come to the conclusion that education directly affects income levels.
By creating a machine learning model that will make use of the above variables, we will learn how do changes in the said variables affect the median household income of a state in the US.
Further, the combination of the Community Needs Index and our model would be a solution for companies to get information on what geographical areas to work in and what factors to work on?
Analysing variables that have high correlation with income.
df_final.corr().iloc[:,0]*100
Income 100.000000 Professional 37.425576 Service 7.523039 Office 21.133021 Construction 13.870081 Production 1.433382 Drive 22.405340 Carpool 11.804959 Transit 8.755939 Walk 6.798403 OtherTransp 10.939322 WorkAtHome 38.836495 Employed 18.149466 PrivateWork 23.141678 PublicWork 21.518222 SelfEmployed 21.215105 FamilyWork 12.983791 Unemployment 3.597943 TotalPop 18.033850 Men 18.286431 Women 17.743456 Hispanic -0.061469 White 23.210477 Black -6.816756 Citizen 18.715275 Housing with Mortgage 32.528122 Median Monthly housing costs 74.232353 High School Graduate -46.421442 Some College or Associates degree 1.172333 Bachelors degree or higher 66.420682 RaceOther 23.615976 Name: Income, dtype: float64
# Importing libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score,mean_absolute_error,r2_score,mean_absolute_percentage_error
# Dropping irrelevant columns for training the model
x_features = df_final.drop(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',"City","State","RaceOther"], axis=1)
y_output = df_final['Income']
# Diving the data into training and testing dataset
x_train, x_test, y_train, y_test = train_test_split(x_features, y_output, shuffle=True)
# Building a linear regression model
model=LinearRegression()
model.fit(x_train, y_train)
# Predicting on test data
y_pred = model.predict(x_test)
# Model performance evaluation
print("\tMean absolute error:", mean_absolute_error(y_test, y_pred))
print("\tR2 score:", r2_score(y_test, y_pred))
print(mean_absolute_percentage_error(y_test, model.predict(x_test)))
plt.figure()
ax1 = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(y_pred, hist=False, color="b", label="Predicted Values" , ax=ax1, )
plt.title('Actual (Red) vs Predicted (Blue) Income')
plt.xlabel('Income')
plt.ylabel('Income')
plt.show()
plt.close()
Mean absolute error: 9668.872091027031 R2 score: 0.6912387735814876 0.17908434488282726
Based on the performance evaluation parameters for the model, we obtain the following scores:
1. R2 (R-squared) Value: 0.68
2. Mean Absolute % Error : 17.7%
These value suggest that our model predicts the income with an error of 17.5%.
We can leverage this model to predict what will the median household income be for a given state.
For our analysis, we have decided to sample the CNI for Alabama as the average median income for the state lies in the range of 50,000 to 60,0000 which shows the highest variance in the graph of our model above.
df_AL=df_final[df_final['State']=='AL']
df_AL=df_final[df_final['State']=='Alabama']
df_AL = df_AL.drop(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',"City","State","RaceOther"], axis=1)
df_AL
| Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | ... | Women | Hispanic | White | Black | Citizen | Housing with Mortgage | Median Monthly housing costs | High School Graduate | Some College or Associates degree | Bachelors degree or higher | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ZipCode | |||||||||||||||||||||
| 35004 | 2112 | 857 | 1357 | 503 | 843 | 5228 | 272 | 0 | 8 | 0 | ... | 6245 | 0 | 10142 | 1303 | 9254 | 2509 | 1122 | 27.670797 | 7.560281 | 14.057267 |
| 35005 | 1118 | 455 | 1352 | 174 | 707 | 3242 | 330 | 0 | 18 | 3 | ... | 4869 | 43 | 4299 | 4167 | 6872 | 1490 | 1105 | 33.783324 | 4.049915 | 8.893931 |
| 35006 | 132 | 216 | 609 | 66 | 135 | 1204 | 92 | 0 | 11 | 0 | ... | 1565 | 0 | 3061 | 103 | 2600 | 446 | 823 | 37.381703 | 3.091483 | 4.416404 |
| 35007 | 5601 | 2055 | 3360 | 811 | 1196 | 10720 | 1349 | 0 | 16 | 72 | ... | 13910 | 1514 | 19291 | 4680 | 18950 | 5632 | 1234 | 15.695964 | 6.325915 | 23.353452 |
| 35010 | 2127 | 1600 | 1851 | 590 | 2070 | 7031 | 1071 | 9 | 17 | 40 | ... | 10464 | 712 | 13162 | 6452 | 15492 | 2558 | 994 | 24.742676 | 5.289984 | 11.411322 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36915 | 96 | 0 | 19 | 6 | 33 | 158 | 79 | 0 | 0 | 0 | ... | 420 | 0 | 496 | 148 | 533 | 30 | 944 | 26.046512 | 3.565891 | 11.007752 |
| 36916 | 79 | 19 | 78 | 43 | 40 | 260 | 32 | 0 | 3 | 1 | ... | 502 | 0 | 487 | 543 | 770 | 50 | 1125 | 22.427184 | 15.048544 | 4.854369 |
| 36919 | 187 | 130 | 81 | 29 | 114 | 455 | 57 | 0 | 11 | 10 | ... | 881 | 0 | 869 | 836 | 1337 | 147 | 887 | 29.232572 | 7.498535 | 5.155243 |
| 36921 | 91 | 45 | 57 | 36 | 48 | 254 | 28 | 0 | 0 | 0 | ... | 447 | 0 | 768 | 46 | 783 | 131 | 1034 | 30.712531 | 11.670762 | 7.002457 |
| 36925 | 250 | 321 | 277 | 20 | 182 | 987 | 31 | 0 | 0 | 0 | ... | 1597 | 70 | 286 | 2545 | 2233 | 321 | 773 | 23.134583 | 5.144804 | 9.982964 |
561 rows × 29 columns
y_pred_al = model.predict(df_AL)
y_pred_al.mean()
52782.82952864726
income_state[income_state['State']=='AL']
| State | Income | |
|---|---|---|
| 45 | AL | 49457.13369 |
From the above analysis, we understand that the predicted income for AL is 52673$, whereas the actual median income is $49457. This shows us that AL has a scope to improve the income.
Now that we have a model that can tell us if a state's median household income can be increased, we must determine which areas need to be improved and in what order. That is exactly what our upcoming analysis will reveal
In this question, we are devising a methodology to compare living conditions in different states.
#We wil be leveraging the data calculated for the CNI during our processing. The data was consolidated into 4 main factors.
# Now we are filtering out the information for Alabama as it has the lowest education
final_cni_Alabama = final_cni.filter(items=['Alabama'], axis=0)
final_cni_Alabama
cni_df_sum_transposed = final_cni_Alabama.T
cni_df_sum_transposed.columns.name = 'Factors'
#Plotting the CNI
cni_plot = cni_df_sum_transposed
## To add the column with colors
cni_plot["Color"] = np.where(cni_plot["Alabama"]<0, 'red', 'green')
# To plot the map
fig = go.Figure()
fig.add_trace(
go.Bar(name='Factors',
x=cni_plot.index.values,
y=cni_plot['Alabama'].values,
marker_color=cni_plot['Color']))
# To update the title of the Map
fig.update_layout(
title={
'text': "CNI for Alabama",
'y':0.9,
'x':0.45,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="Factors",
yaxis_title="Difference(%) compared to national value"
)
fig.update_layout(barmode='stack')
fig.show()
The graph above shows us in which factors Alabama is below the national mean. The CNI verifies our theory that as the income is lower than it's potential, it has a scope for improvement in one or more factors which need to be considered when looking to improve the lifestyle of the people in a State.
For variable identification, we looked at factors (such as jobs) required by the community by assigning weights to sub-factors (such as public, private, and self-employment) and converting absolute numbers provided for these factors into proportions to compare geographies. Following that, we developed a highly dynamic predictive model based on linear regression with an accuracy of 83%. This model assists us in determining which states have room for improvement in terms of income. If our predicted value is higher than what the current average income is, we conclude that the state’s conditions can be bettered.
The Community Needs Index (CNI) analysis discusses the areas that can be improved within a state. We have brought down 39 columns of data into just 4 major factors. This will help people understand the needs of a state on a broader scope. They also have the opportunity to delve into the data at a sub-factor level.
Alabama was chosen as an example because its income could increase by 6.7%. So, when we conducted an in-depth analysis, we discovered that, while affordable housing was 18% lower than the national average, transportation in the state was less than 43% of the national average, and employment was 17% lower than the national average.
With this in-depth analysis, we can easily conclude that the state of Alabama or any magnanimous organization should prioritize transportation before addressing other issues.
Aside from the obvious interests that NGOs and governments have in the CNI, the best clients are for-profit organizations that want to improve their CSR activities. Big organizations take responsibility for giving back to the community, and it is common to see them construct parks, schools, overhead bridges, and restore landscapes that benefit the environment, among so many others.
With this model, any organization can easily understand the community's true needs, focusing on the pain points of each individual at the state level. This assists businesses in achieving accurate results and, as a result, receiving praise, adoration, and brand recognition.
Future iterations of the model and its scope:
Now that we have a model that works at the state level, we can replicate it down to the zip code level. As a result, even smaller organizations and philanthropists can help uplift communities around them beyond their own capacity.